# Image Caption Generation
Qwen2.5 VL 7B Instruct Gemlite Ao A8w8
Apache-2.0
This is a multimodal large language model quantized with A8W8, based on Qwen2.5-VL-7B-Instruct, supporting vision and language tasks.
Image-to-Text
Transformers

Q
mobiuslabsgmbh
161
1
Devstral Small Vision 2505 GGUF
Apache-2.0
Vision encoder based on Mistral Small model, supports image-text generation tasks, compatible with llama.cpp framework
Image-to-Text
D
ngxson
777
20
Blip Gqa Ft
MIT
A fine-tuned vision-language model based on Salesforce/blip2-opt-2.7b for visual question answering tasks
Text-to-Image
Transformers

B
phucd
29
0
Blip Custom Captioning
Bsd-3-clause
BLIP is a unified vision-language pretraining framework, excelling in vision-language tasks such as image caption generation
Image-to-Text
B
hiteshsatwani
78
0
Gemma 3 12b It Qat 3bit
Other
This is an MLX-format model converted from the Google Gemma 3-12B model, supporting image-text-to-text tasks.
Image-to-Text
Transformers Other

G
mlx-community
65
1
My Model
MIT
GIT is a Transformer-based image-to-text generation model capable of generating descriptive text from input images.
Image-to-Text
PyTorch Supports Multiple Languages
M
anoushhka
87
0
Qwen2 VL 7B Captioner Relaxed GGUF
Apache-2.0
This model is a GGUF format conversion based on Qwen2-VL-7B-Captioner-Relaxed, optimized for image-to-text tasks and supports running via tools like llama.cpp and Koboldcpp.
Image-to-Text English
Q
r3b31
321
1
Llama Joycaption Alpha Two Hf Llava FP8 Dynamic
MIT
This is an FP8 compressed version of the Llama JoyCaption Alpha Two model developed by fancyfeast, implemented using the llm-compressor tool and compatible with the vllm framework.
Image-to-Text English
L
JKCHSTR
248
1
Blip Image Captioning Large
Bsd-3-clause
A vision-language model pre-trained on the COCO dataset, excelling in generating accurate image descriptions
Image-to-Text
B
drgary
23
1
Florence 2 Base Castollux V0.4
An image caption generation model fine-tuned based on microsoft/Florence-2-base, focusing on improving description quality and format
Image-to-Text
Transformers English

F
PJMixers-Images
23
1
Molmo 7B D 0924 NF4
Apache-2.0
The 4Bit quantized version of Molmo-7B-D-0924, which reduces VRAM usage through the NF4 quantization strategy and is suitable for environments with limited VRAM.
Image-to-Text
Transformers

M
Scoolar
1,259
1
Llava Llama3
LLaVA-Llama3 is a multimodal model based on Llama-3, supporting joint processing of images and text.
Image-to-Text
L
chatpig
360
1
Qwen2 VL 7B Captioner Relaxed Q4 K M GGUF
Apache-2.0
This is a GGUF format model converted from the Qwen2-VL-7B-Captioner-Relaxed model, specifically designed for image-to-text tasks.
Image-to-Text English
Q
alecccdd
88
1
Vitucano 1b5 V1
Apache-2.0
ViTucano is a natively Portuguese pre-trained visual assistant that integrates visual understanding and language capabilities, suitable for multimodal tasks.
Image-to-Text
Transformers Other

V
TucanoBR
37
2
Microsoft Git Base
MIT
GIT is a Transformer-based generative image-to-text model capable of converting visual content into textual descriptions.
Image-to-Text Supports Multiple Languages
M
seckmaster
18
0
BLIP Radiology Model
BLIP is a Transformer-based image captioning model capable of generating natural language descriptions for input images.
Image-to-Text
Transformers

B
daliavanilla
16
0
Vit GPT2 Image Captioning
An image captioning model based on the ViT-GPT2 architecture, capable of generating natural language descriptions for input images.
Image-to-Text
Transformers

V
motheecreator
149
0
Vit GPT2 Image Captioning Model
An image caption generation model based on the ViT-GPT2 architecture, capable of converting input images into descriptive text
Image-to-Text
Transformers

V
motheecreator
142
0
Moondream Caption
Apache-2.0
A customized small vision model based on Moondream2, fine-tuned specifically for image caption generation tasks
Image-to-Text
Transformers

M
wraps
108
9
Base ZhEn
This model is used to convert image content into textual descriptions and is suitable for non-commercial purposes.
Text Recognition
B
MixTex
50
0
Peacock
Other
The Peacock Model is an Arabic multimodal large language model based on the InstructBLIP architecture, with AraLLaMA as its language model.
Image-to-Text Arabic
P
UBC-NLP
73
1
Llama 3 EZO VLM 1
A Japanese vision-language model based on Llama-3-8B-Instruct, enhanced with additional pretraining and instruction tuning for improved Japanese capabilities
Image-to-Text Japanese
L
AXCXEPT
19
7
Florence 2 Large Ft
MIT
Florence-2 is an advanced vision foundation model developed by Microsoft, employing a prompt-based paradigm to handle various vision and vision-language tasks.
Image-to-Text
Transformers

F
zhangfaen
14
0
Florence 2 SD3 Captioner
Apache-2.0
Florence-2-SD3-Captioner is an image caption generation model based on the Florence-2 architecture, specifically designed for generating high-quality image captions.
Image-to-Text
Transformers Supports Multiple Languages

F
gokaygokay
80.06k
34
Test Push
Apache-2.0
distilvit is an image-to-text model based on a VIT image encoder and a distilled GPT-2 text decoder, capable of generating textual descriptions of images.
Image-to-Text
Transformers

T
tarekziade
17
0
Florence 2 Base Ft
MIT
Florence-2 is an advanced vision foundation model developed by Microsoft, employing a prompt-based approach to handle a wide range of vision and vision-language tasks.
Image-to-Text
Transformers

F
lodestones
14
0
Vit Base Patch16 224 Distilgpt2
Apache-2.0
DistilViT is an image caption generation model based on Vision Transformer (ViT) and distilled GPT-2, capable of converting images into textual descriptions.
Image-to-Text
Transformers

V
tarekziade
17
0
Convllava JP 1.3b 1280
ConvLLaVA-JP is a Japanese vision-language model that supports high-resolution input and can engage in conversations about input images.
Image-to-Text
Transformers Japanese

C
toshi456
31
1
Image Captioning Vit Gpt2 Flick8k
Apache-2.0
This model can convert input images into descriptive text, suitable for image understanding tasks in various scenarios.
Image-to-Text
Transformers

I
pltnhan311
18
0
Final Model
Apache-2.0
This model is an image-to-text model based on the Apache-2.0 license, capable of converting image content into textual descriptions.
Text Recognition
Transformers

F
goatrider
17
0
Paligemma 3b Ft Scicap 448
PaliGemma is a multi-functional lightweight vision-language model that combines image and text inputs to generate text outputs and supports multiple languages.
Text-to-Image
Transformers

P
google
123
0
Paligemma 3b Ft Scicap 224
PaliGemma is a lightweight vision-language model that combines image and text inputs to generate text outputs, supporting multilingual and multi-task processing.
Image-to-Text
Transformers

P
google
107
0
Paligemma 3b Ft Ocrvqa 896
PaliGemma is a multi-functional lightweight vision-language model that supports image and text input and generates text output, suitable for various vision-language tasks.
Image-to-Text
Transformers

P
google
2,056
14
Blip Image Captioning Base Bf16
MIT
This model is a quantized version of Salesforce/blip-image-captioning-base, reducing floating-point precision to bfloat16, cutting memory usage by 50%, and is suitable for image-to-text generation tasks.
Image-to-Text
Transformers

B
gospacedev
20
1
Image Model
This is a transformers-based image-to-text conversion model, specific functionalities require further details
Image-to-Text
Transformers

I
Mouwiya
15
0
Heron Chat Git Ja Stablelm Base 7b V1
A vision-language model capable of conversing about input images, supporting Japanese interaction
Image-to-Text
Transformers Japanese

H
turing-motors
54
2
Uform Gen2 Dpo
Apache-2.0
UForm-Gen2-dpo is a small generative vision-language model, aligned for image caption generation and visual question answering tasks through Direct Preference Optimization (DPO) on VLFeedback and LLaVA-Human-Preference-10K preference datasets.
Image-to-Text
Transformers English

U
unum-cloud
3,568
44
Moondream Prompt
Apache-2.0
A fine-tuned version of Moondream2, optimized for image prompt generation. It is a lightweight vision-language model suitable for efficient operation on edge devices.
Image-to-Text
Transformers

M
gokaygokay
162
10
Distilvit
Apache-2.0
A vision-language model based on VIT image encoder and distilled GPT-2 text decoder for image caption generation tasks
Image-to-Text
Transformers

D
Mozilla
290
19
Git Base Minecraft
MIT
This is a vision-based image-to-text model capable of generating image descriptions.
Image Generation
Transformers Supports Multiple Languages

G
orzhan
22
0
- 1
- 2
- 3
Featured Recommended AI Models